In this document, we carry out a first analysis of missing data patterns in the daily data set. Graphs and figures displayed in this document are purely informative and not intended for publication in their current state. Most graphs are produced using loops and therefore display standard and uniform settings. Some fine tuning should be performed to produce specific graphs to be published.
First of all, we noticed that there are some issues with the data set and therefore, for now, we restrict the data set to observations not subject to these issues. In particular, we do not have pollution data for NO in 2014, 2015, 2019 and 2020 and very few observations for PMs in 2019. These issue do not come from actual missing data but are due to issue with the package we used to access the pollution data. These data are available through the EEA interface.
We fist analyse the proportion of missing observations for each covariate.
| Variable | Proportion of missing values |
|---|---|
| City | 0.0000000 |
| Concentration | 0.0194582 |
| Elevation weather station | 0.0000093 |
| Global radiation | 0.2730897 |
| Holiday zone | 0.0000000 |
| Insolation duration | 0.2406190 |
| Latitude weather station | 0.0000093 |
| Longitude weather station | 0.0000093 |
| Public holiday | 0.0000093 |
| Rainfall duration | 0.0718760 |
| Rainfall height | 0.0009501 |
| Relative humidity | 0.0004471 |
| School holiday | 0.0000093 |
| Sea level pressure | 0.0000838 |
| Temperature | 0.0000838 |
| Uv radiation | 0.8994490 |
| Wind direction | 0.0005775 |
| Wind speed | 0.0005542 |
In order to understand these previous results, we break down the analysis by year and represent the results in a graph, for readability.
In complete case analyses, observations for which any variable is missing are dropped. One may therefore wonder what is the share of dropped observations in a complete case analysis. We only consider variables which are potentially relevant. We also drop the UV radiation variable due to its large share of missingness.
Carrying out a complete case analysis would lead to drop 36.7068834% of the observations. However, one may notice that also dropping insolation duration, global radiations and rainfall duration lead the limiting factor to be concentration data.
The stations considered are located in the 17 biggest largest in France:
| City | Number of stations |
|---|---|
| Bordeaux | 4 |
| Clermont-Ferrand | 13 |
| Dijon | 7 |
| Grenoble | 5 |
| Le Havre | 17 |
| Lille | 7 |
| Lyon | 22 |
| Marseille | 11 |
| Montpellier | 7 |
| Nancy | 4 |
| Nantes | 16 |
| Nice | 9 |
| Paris | 29 |
| Rennes | 9 |
| Rouen | 7 |
| Strasbourg | 9 |
| Toulouse | 19 |
Here, we investigate whether missing pollutant concentration data varies across different dimensions.
The overall share of missing air pollution observations is 0.0324046.
In this section, we investigate whether the share of missing values varies with the values of covariates. One may expect that, for extreme values of some covariates, such as temperature, wind speed or precipitation level for example, measurement instruments are more likely to be defective, leading to more missing values.
One can notice that the share of missing values varies across pollutants, up to about a factor two. This highlights the potential necessity of analyzing missingness patterns independently across pollutants.
| Pollutant | Proportion missing values |
|---|---|
| no | 0.0037740 |
| no2 | 0.0048729 |
| o3 | 0.0172493 |
| pm10 | 0.0610813 |
| pm2.5 | 0.0522411 |
| so2 | 0.0702874 |
We look whether missingness patterns vary across location characteristics.
[[1]]
[[2]]
We then investigate whether the share of missing values evolves with dates and time.
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
We also explore more closely these patterns for some variables by decomposing them by year, month or pollutant.
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
[[8]]
[[9]]
[[10]]
We plot the same graphs as before but only considering hours were the data started missing, not considering later and consecutive missing observations.
As compared to the full sample, the share of missing data decreases since we discarded many observations with missing values (every observation which was not the first observation of their period of missing data). Hence, the share of missing data is not informative in itself, only potential differences in this share across “grouping variables”.
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
[[8]]
[[9]]
[[10]]
[[11]]
[[12]]
[[13]]
[[14]]
[[15]]
[[16]]
[[17]]
[[18]]
We first investigate whether covariates are balanced between observations for which concentration data is missing and non missing.
We can refine this analysis by looking separately across cities.
It might also be interesting to see whether covariates have a similar distribution for observations where data is missing and when it is not.
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
[[8]]
[[9]]
[[10]]
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
One may also be interested in looking at these distributions by pollutant. The results are rather similar across all pollutants. We do not display them to avoid overcrowding (even more than already is) the document.
If data is missing due to external factors, what matters might be the value of these external factors when the data started missing, ie potentially when the sensor first became defective. As a consequence, we look into the distribution of the covariates for the last value before a missing concentration observation.
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
[[8]]
[[9]]
[[10]]
[[11]]
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
For concentration, we carry the last value forward in order to see whether missing concentration data is associated with different concentration values, just before the data is missing as compared to when concentration data is not missing. We also filter out high concentration values in order to see the distribution more clearly.
In this section, we explore the length of periods with missing observations. This length may provide information on causes of missingness. Missing observations for long periods of time may be indicative of cluttered filters of broken instrument. We also explore whether the length of missingness patterns is correlated with weather variables.
First, we explore the length of missing observations by looking at the displaying, in an heatmap, for each couple city*date, whether concentration data is missing. We break this down into years for readability.
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
Then, we look at the length of periods with missing data. First, we can either count each the number of periods with a given length (eg 3 periods have a length of missing data of 5 hours/days) or count the number of dates belonging to periods with a given length (considering the same example, 15 dates belong to a period of missing data of length 5 hours/days). We denote the former case “One observation per period” and the later “One observation per date”.
We might be interested in looking at the length of missing periods for different pollutants. The method to measure concentration varies across pollutants and reasons for missing data may depend on the method. Particulate matter is measured with filters which can become cluttered. This could lead to rather long missing periods, with the necessary time to clean the filter. Gaseous pollutants are measured using optical methods and thus not subject to cluttered filters.
As previously, we look at the distributions considering one observation per missing period and one observation per date. This later case naturally changes greatly the distribution; for instance one series of missing data of 100 hours/days is only accounted for once in the former case but 100 times in the later.
In this section, we investigate whether period length of missing data varies with weather variables. Due to the larger number of observations considered here, instead of looking at a scatter plot, we look at bivariate distribution plots
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
[[8]]
[[9]]
[[10]]
Then, we look into weather “values” when variables started missing. If missingness is caused by some weather feature, the weather at the time of the first missing observation would be the one to look into.
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
[[8]]
[[9]]
[[10]]